A novel approach to topological network analysis for the identification of metrics and signatures in non-small cell lung cancer

Non-small cell lung cancer (NSCLC), the primary histological form of lung cancer, accounts for about 25%—the highest—of all cancer deaths. As NSCLC is often undetected until symptoms appear in the late stages, it is imperative to discover more effective tumor-associated biomarkers for early diagnosis. Topological data analysis is one of the most powerful methodologies applicable to biological networks. However, current studies fail to consider the biological significance of their quantitative methods and utilize popular scoring metrics without verification, leading to low performance. To extract meaningful insights from genomic data, it is essential to understand the relationship between geometric correlations and biological function mechanisms. Through bioinformatics and network analyses, we propose a novel composite selection index, the C-Index, that best captures significant pathways and interactions in gene networks to identify biomarkers with the highest efficiency and accuracy. Furthermore, we establish a 4-gene biomarker signature that serves as a promising therapeutic target for NSCLC and personalized medicine. The C-Index and biomarkers discovered were validated with robust machine learning models. The methodology proposed for finding top metrics can be applied to effectively select biomarkers and early diagnose many diseases, revolutionizing the approach to topological network research for all cancers.


Results
Stage I: Biomarker signature identification. Identifying specific and sensitive biomarkers is critical for early cancer diagnosis 18 . As the first stage of this study, we identified the top biomarkers to serve as a base for us to explore in depth the critical scoring metrics for biomarker identification in the next stage. An overview of this study and its two main stages is presented in Supplementary Fig. S1.
DEG screening and functional enrichment. In total, we identified 267 differentially expressed genes (DEGs) (p-value < 0.01 and |logFC| > 1.2 ) from the three datasets GSE31210, GSE33356, and GSE50081, taken from the Gene Expression Omnibus (GEO) 19 . The three datasets are summarized in Supplementary Table S1. 93 were upregulated and 174 were downregulated (Fig. 1a). To investigate the roles that the DEGs play in disease mechanisms, we examined the DEG-related pathways. Through enrichment analyses, we identified 10 enriched GO terms with FDR < 0.05 (Fig. 1c), and their z-score expression values (Fig. 1b).
Upregulated DEGs play a role in multiple pathways that promote tumorigenesis, including cell division (GO:0051301), cell cycle (GO:0007049), cytoskeleton (GO:0005856), and spindle pole (GO:0000922). Downregulated genes were significantly involved in weakened tumor defense and disruptions in signal transduction pathways, such as decreased cytokine activity (GO:0005125) and clathrin-coated endocytic vesicles (GO:0045334). Downregulated genes were also enriched in the extracellular matrix (GO:0005576 and GO:0005578), and ECMreceptor interaction (hsa04512, FDR = 0.0152) was identified through KEGG pathway analysis. Complications in the ECM-receptor interaction pathway can result in induced cancer progression and development, as ECMreceptors play important roles in tumor shedding, adhesion, and degradation 20 . Our results show that the DEGs are largely connected to disease-related pathways and play potent roles in cancer onset and development.
PPI network analysis. Disease susceptibility and other disease correlated factors are due to the perturbation of an interconnected gene network 21 , not single gene mutations in isolation. To explore the interactions between DEGs and identify intrinsic mechanisms of disease, we constructed a protein-protein interaction (PPI) network ( Fig. 2a) based on our 267 DEGs to understand the topology of molecular interactions and identify the most essential top-scoring biomarkers in the network. The network was visualized using Cytoscape 22 and further analyzed with the CytoHubba algorithm 23 . The nodes in the center of the network (Fig. 2b) and the cluster modulus ( Fig. 2c) are zoomed in, as these significant regions contain highly connected nodes that have great impact on the other nodes in the network. The topological scoring metrics in CytoHubba are divided into two categories, local to evaluate individual nodes and global to evaluate the network as a whole. The local metrics include Degree, Maximal Clique Centrality (MCC), Density of Maximum Neighborhood Component (DMNC), Maximum Neighborhood Component (MNC), and Clustering Coefficient. The global metrics include Betweenness, Bottleneck, Eccentricity, Closeness, Radiality, Stress, and Edge Percolated Component (EPC). Often, literature studies utilize only one scoring method [15][16][17] . To ensure that no essential genes are missed and all possibilities are considered, we created a complete and comprehensive list of candidate biomarkers using all twelve scoring methods.
Each metric was utilized to select 10 top nodes each, with some having a higher cutoff because a few nodes share the same ranking score. Without counting overlapping genes between metrics, we obtained 82 candidate biomarkers (Table 1a) overall. To evaluate the ability of biomarkers in distinguishing between disease and control, we chose to use area under the receiver operating characteristic curve (AUC) score to select the overall top 20  (Table 1b). This allows us to directly evaluate the biomarkers by their ability to predict disease.
Disease prediction with multiple biomarkers simultaneously. Cancer is caused by multiple genes in a functional or signaling pathway working together in a cascade of mutations to promote tumorigenesis, and can never be caused or predicted by a single mutation or gene. Rather than only considering the capability of individual biomarkers as commonly done in literature studies, we further explore the concurrent use of multiple biomarkers in NSCLC prediction to vastly increase the diagnostic performance, Utilizing multiple biomarkers is more comprehensive, and may better deal with disease heterogeneity and reduce anomalies during prediction.To the best of our knowledge, the incremental usefulness of adding multiple biomarkers from different disease pathways has not been fully evaluated amongst other NSCLC studies. Simply using too many may decrease performance. We need to find the optimal number of biomarkers to use concurrently for the highest performance, as well as a way to evaluate the joint performance of biomarkers.
As each biomarker's expression value quantifies its relationship to the health condition of a subject, we propose the concept of Integrated AUC to calculate the AUC of the aggregated expression of biomarkers. In this study, the aggregated expression is defined as the mean expression of the group of biomarkers as it is most suitable, but its definition may be expanded.  www.nature.com/scientificreports/ After evaluating the performance of several different gene groupings, we found that the top 4 biomarkers AGER, CA4, RASIP1, and CAV1 together produced the highest Integrated AUC at 0.9238 (Fig. 3a). They make up our 4-gene biomarker signature for the prediction of NSCLC. Their Receiver Operating Characteristic (ROC) curves are visualized in Fig. 3c. For comparison purposes, we also calculated the AUC of the top 10 diseasecorrelated genes in the network (from Table 1b) combined. As expected, the 4-gene signature outperformed 10 genes (Fig. 3b). This may be due to the nature of interactions between genes-certain biomarkers may not interact optimally for high performance, or biomarkers outside of the top four may play roles of lower significance. The finding of the biomarker signature and the use of Integrated AUC as an evaluation metric can be expanded on and further explored to improve the prediction accuracy of other types of cancers.
Validation of 4-gene signature by survival analysis and TCGA database. At the beginning of the study, we randomly divided the dataset into 80% for identification of biomarkers and metrics, and 20% validation. To validate the effectiveness of using the 4-gene signature, we compared its performance in the validation data set to each of its 4 individual components, as well as that of the top 10 genes combined. The 4-gene signature outperformed all of its individual genes, as well as the top 10 genes together (Supplementary Table S2), demonstrating that it is less complex and more effective than using more genes, and more powerful than only using individual biomarkers alone.
To validate the effect of the 4-gene signature on NSCLC prognosis, we performed overall survival analysis with these genes using Kaplan-Meier survival plots (Fig. 4a) to examine their impact on patient survival, with a threshold of p-value < 0.01 to determine significance. The low expression of AGER, CA4, RASIP1, and CAV1 are all associated with poor overall survival, indicating their significant roles in NSCLC prognosis.
To confirm that our results are applicable outside our data set, TCGA data from the GEPIA interactive website was used to verify the identified genes to be effective amongst other NSCLC cases. Figure 4b compares gene expressions from two histological types of NSCLC, lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD), and normal lung tissues. All four genes are significantly downregulated in cancerous patients, indicating that the four-gene signature can be expanded and used for new patients and data. From our GO enrichment analysis, these genes play important roles in the ECM matrix and signalling pathways. They are important for cancer detection and treatment, and can act as therapeutic targets for future drug therapy and personalized medicine.
Stage II: Critical network topological metrics. In the past decade, topological data analysis has grown into a prominent role in the oncology field. The scoring metrics are crucial to topological data analysis as they identify the most influential network nodes. Scoring metrics, however, describe only geometric relationships www.nature.com/scientificreports/ between nodes in the network, not their relation to disease diagnosis. Consequentially, the significance and implications of the scoring methods are vastly overlooked in most biological studies, and a single metric is often employed without reasoning 24 . On the other side of the spectrum, many studies simply use all 12 metrices together, which is also ineffective 25 . The most frequently used metric is Degree, a local metric that may not capture the full extent of gene network interactions. Our foremost goal in the second stage of this work is to thoroughly analyze the performance of these quantitative methods applied to biological networks and identify the top metrics that best capture functional connectivity and biological implications. We set out to achieve this in two major phases. We first evaluate all 12 topological scoring metrics based on their diagnostic ability for effective biomarker selection. In the search for biomarkers, using a single metric may be inadequate, while using all metrics together is complex and inefficient, and possibly erroneous. Therefore, in the second phase, we further investigate the performance of using multiple metrics concurrently in diagnosis and design a powerful metric. Few previous studies have meaningfully utilized multiple network metrics concurrently to look for biomarkers 17,26 . Our investigations indicate that increasing the analytical coverage of a biological system through optimal pairings of metrics leads to more robust results.
Evaluating the performance of individual metrics. To evaluate the ability of the metrics to identify essential biomarkers, we calculated the Integrated AUC of biomarkers selected by each metric in Table 1a. For compari- Table 1. Top biomarkers in the PPI network based on all 12 network scoring metrics.
(a) Top ranked nodes found through all 12 topological scoring methods
To measure both the node-level properties and the network-level properties, we chose to compose the top local and global metrics (which were also the top two metrics overall), Clustering Coefficient and Bottleneck (Fig. 5). 7 out of the top 10 overall genes were correctly identified, resulting in a 70% precision, and the Integrated AUC of the top 10 nodes was 0.9027. Likewise, the top 20 nodes had 70% precision, and Integrated AUC was 0.8765. Already, the composition results in a higher AUC than any of the individual metrics.
To determine whether the composition of Clustering Coefficient and Bottleneck is the most optimal, we also composed the three metrics Clustering Coefficient, Bottleneck, and Eccentricity. Although Betweenness outperformed Eccentricity, we chose to use Eccentricity because by definition Betweenness is essentially the same measure as Bottleneck and is thus not needed in the composite. With three, although the precision increased, the Integrated AUC reduced slightly. We also composed the top four metrics but obtained no improvement in AUC or precision. Considering the vastly increased complexity of adding more metrics, using two metrics together is ultimately more powerful than using three or more.
Our investigations have shown that Clustering Coefficient and Bottleneck can work together to adequately capture the significant biomarkers in the network with the most efficiency, and we propose to compose them as a new metric, C-Index. C-Index is defined as C-Index outperforms all individual metrics and other compositions. It can achieve the comprehensiveness of using all metrics while vastly decreasing the complexity and inefficiency. Clustering Coefficient and Bottleneck work together as local and global metrics to better capture the nature of cancer and are capable of identifying essential genes that are prevalent throughout the entire network as well as locally.
Most literature studies use Degree and occasionally Betweenness to find their top biomarkers 28,29 . To investigate whether the combination of these two metrics results in effective biomarker selection, we calculated the performance of their composition. We obtained shockingly poor results: the top 10 genes had 40% precision, and 0.6295 Integrated AUC, while the top 20 genes had only 25% precision, and 0.6721 Integrated AUC. This poor performance is due to several factors that are often overlooked by other studies. Degree, a local metric, only measures how many connections a node has, which may signify that it regulates many other genes. However, these highly connected "hubs" have no great overall influence in cancer-causing pathways because their neighbors are not necessarily interconnected. In contrast, the Clustering Coefficient of a network, also a local metric, measures the tendency of nodes to form densely connected communities. In biological networks, these communities signify functional modules and gene complexes that work closely together and share similar functions. Clustering Coefficient identifies key biomarkers located in significant communities that work together to induce cancer and is a much more robust local metric than Degree in locating essential genes.
Although Clustering Coefficient alone is a powerful metric, it is local and unable to capture the overall topologies of the network. To gain a better view of the essential genes that play important roles in the network as a whole, Bottleneck is used complementary with Clustering Coefficient. Bottlenecks, genes with the highest betweenness centrality, control most of the information flow and interaction between proteins and are key connectors between regulatory pathways that are identified through Clustering Coefficient.
Together, Clustering Coefficient and Bottleneck are capable of considering multiple biological pathways, and improve on the performance of Degree from 78.75 to 88.24%. Not only does C-Index perform better than individual metrics, but it matches the performance of using all 12 metrics together with greatly reduced complexity and cost. The AUC of the overall top 10 biomarkers identified using all 12 metrics was 0.9064, as calculated in Stage I of this study. The AUC of the top 10 biomarkers identified using C-Index was 0.9027. It is as powerful as using all twelve, while much more efficient.
Our results indicate that the proposed C-Index is capable of capturing the most critical group of interactive genes that can early diagnose cancer. Our method of evaluating scoring methods is transformative and a breakthrough in applying topological analysis to biomarker identification.

Performance of C-Index in validation set.
To validate the C-Index, we calculated the Integrated AUC of the metrics in the 20% validation set. As predicted, C-Index outperformed Degree and each of its components (Supplementary Table S3). Compared to Degree, the performance of using C-Index to select biomarkers increased by 25%. Clustering Coefficient alone improved on Degree by 22.4%. Compared to using all 82 genes found by all 12 metrics, the performance of using only genes found by C-Index is 40% higher. This demonstrates that the metrics that compose C-Index vastly improve on conventional ones, and are capable of identifying a concise list of significant genes most successfully. Benefiting from both metrics with close interactions locally and genes that lie on the critical paths linked globally, C-index can revolutionarily select the most representative group of biomarkers at a low computational cost for accurate diagnosis.

Machine learning validation of biomarkers and metrics.
We further validated the above results with a random forest machine learning model to evaluate their capabilities to classify cancer from no cancer in the testing dataset. The model was trained in the 80% dataset, and validated in the 20% dataset. The model considers not only genetic attributes, but also clinical attributes such as age, gender, and smoking status that most impact the development of NSCLC to improve the accuracy of diagnosis. Literature studies often decouple genomic and clinical attributes in cancer prediction. To incorporate both types of attributes and provide a validation method that does not involve Integrated AUC, we design this clinicogenomic model to explore their potential for disease diagnosis.
With the addition of the 4-gene signature, the purely clinical model was improved by 51% (Fig. 6a). Compared to using biomarkers alone, the addition of clinical variates also increased performance. The 4-gene model once again outperforms the top-10 gene model, validating the effectiveness of the signature. In addition to validating its ability to distinguish cancer from no cancer, the 4-gene model was further extended to diagnose early vs. late disease stage, as well as NSCLC stages I-IV with a multi-stage Cascading Model design that is described in greater detail in "Methods" (Fig. 6d).
C-Index outperformed the conventional metric by 25% and all the metrics together by 37% (Fig. 6b). Utilizing all metrics suffered from the existence of outliers and inaccuracy. Degree, on the other side of the spectrum, was unable to identify a complete set of genes that could accurately capture the interactions of the network and thus performed poorly.
Furthermore, validation was conducted not only on the testing set, but also on the TCGA LUSC and LUAD cancer datasets (Fig. 6c), obtained from UCSC Xena 30 , which were processed in the same manner as the GEO dataset. The pretrained 4-gene and C-Index models were assessed on their ability to discern between cancer and non-cancer cases in this new validation cohort. The 4-gene model achieved 96% accuracy, and the C-Index

Discussion
It is vitally important to identify critical biomarkers for exploring the pathogenesis of NSCLC, one of the deadliest cancers in the world. The key to improving prevention and early diagnosis is to find metrics and methods that can guide the effective yet cost efficient search of biomarkers. There are several important findings from this study. First, through a series of functional, network, and statistical analyses, we identify a 4-gene biomarker signature consisting of AGER, CA4, RASIP1, and CAV1 that can be further explored as possible therapeutic targets for drug treatment. Second, we prove that the most widely used topological scoring metric, Degree, is not the best suited for biological networks. We instead propose C-Index, a novel composite index that combines Clustering Coefficient and Bottleneck to best capture the interactions in gene networks for high efficiency and performance. Our results solidify the connection between geometric connectivity and functional connectivity. For validating the C-Index and 4-gene signature, we exploited the use of a machine learning model that considers both genomic and clinical factors concurrently.
To the best of our knowledge, this is the first study that comprehensively evaluates all 12 topological network scoring metrics and their effectiveness in identifying cancer-related biomarkers in biological networks. Compared to previous studies that solely relied on popular metrics like Degree 15-17 , selected metrics without any validation or reasoning provided, or simply used all 12 metrics together 6,25 , which was shown to be inefficient in this study, our study thoroughly evaluates all 12 metrics using our proposed Integrated AUC. Moreover, many previous studies only used one metric or lacked effective metric composition, which is not enough to accurately quantify and characterize the disease network. Our study advances upon these studies by evaluating and validating different metric compositions to identify the most effective one for biological networks: the C-Index.
The two metrics that compose C-Index, Clustering Coefficient and Bottleneck, effectively capture local and global gene interactions in the network. We hypothesize that Clustering Coefficient identifies significant communities that are most likely involved in pathways to promote tumorigenesis. Bottlenecks are critical points in the network that connect the biological pathways identified through Clustering, and may act as key signaling molecules. These two metrics work in tandem to effectively identify genes that work together in pathways to induce cancer. Additionally, we hypothesize that Degree, which only considers the immediate neighborhood of nodes 27 , may not be a robust indicator of network topology as it fails to adequately capture the connectedness and centrality of nodes within the network.
This study largely focused on biological implications. Through functional enrichment analysis, we found that the four genes in the 4-gene signature are enriched in GO terms of receptor activity, immune response, www.nature.com/scientificreports/ extracellular matrix, and signaling activity, which all play an important role in regulating proliferation, differentiation, and apoptosis. The significant down-regulation of these four anti-tumor genes in NSCLC patients signifies that a change in their expressions disrupts the tumor microenvironment, promoting tumorigenesis. These results match the significant GO and KEGG pathways identified in the first part of work. In particular, AGER exhibits significant enrichment in immune signaling pathways. AGER has been found to be downregulated in NSCLC cancer cells, and its overexpression has been shown to suppress cancer cell proliferation, invasion, and migration, while promoting apoptosis 26 . Therefore, its downregulation decreases the effectiveness of these inhibitory effects. Similarly, CA4, which affects the cell cycle and inhibits cell proliferation by downregulating the expression of CDK2, was found to be downregulated in NSCLC cancer cells 31 . Its downregulation prevents the inhibition of tumor cell proliferation. The downregulation of RASIP1, which is involved in GTPase binding and cell-cell attachment, was found to impair cell-cell attachment and possibly promote cell migration in NSCLC patients 32 . Finally, CAV1, a scaffolding protein that may act as both a tumor suppressor and a promoter of metastasis, depending on the type of cancer and stage 33 , has been found to be downregulated in many tumors, including NSCLC 34 . The significant biomarkers discovered may be further explored through knock-out trials and analysis of their impacts on cancer prognosis. Results from these trials can be applied to developing drug therapies that target these biomarkers.
This study has shown that using multiple biomarkers and metrics concurrently greatly improves performance. This is because genes work together in pathways that lead to tumorigenesis, and single genes cannot cause cancer without having numerous regulatory effects on other genes through signal transduction pathways. Cancer is a complex disease caused by the interaction of multiple environmental factors and genes. It is the combined effect of all these genes in the pathway together that leads to cancer onset. The 4-gene biomarker signature and biomarkers selected by C-index accurately capture the nature of cancer. With further validation and refinement in other cancer datasets, they are promising for the study of biomarkers and all cancers. The C-Index greatly increases the efficiency and accuracy of future biomarker searches, allowing for the low-cost identification of biomarkers with great diagnostic capability. The findings from this study provide an experimental foundation for further exploration of the usage of PPI networks to diagnose cancers. Most importantly, our results indicate that the conventional method of approaching TDA in oncology is greatly ineffective.
Topological analysis is a powerful method to analyze biomedical data. Our work lends itself to further exploration in settings other than biomarkers. Possible future directions include predicting treatment responses, cellular architecture determination, tumor segmentation, and other applications of cancer data. Our research can be expanded to other types of cancers and more datasets, and our gene signature and C-Index need to be further extensively validated. The proposed methodology of finding top metrics can be extended to effectively and efficiently select biomarkers in various types of cancers, not just NSCLC, which helps to fundamentally advance the topological network research for the continuous pursuit of cancer prevention.

Methods
Stage I: Biomarker signature identification. Dataset. In total, we analyzed 547 NSCLC samples consisting of 467 lung tumor samples and 80 normal lung samples from the Gene Expression Omnibus (GEO) database 19 , a national repository of genetic information databases. In this study, we retrieved and combined three gene expression profiles (GSE31210, GSE33356, and GSE50081) to ensure greater accuracy and comprehensiveness. The datasets are summarized in Supplementary Table S1. Finally, the merged dataset was randomly divided into 80% for identification of top biomarkers and metrics, and 20% for validation.
Data preprocessing and DEG identification. GEO2R analysis was performed to detect DEGs in NSCLC tumor samples compared with normal lung samples. An initial pool of 267 statistically significant DEGs (p-value < 0.01 and |logFC| > 1.2 ) were identified for further analysis and classified as up or down-regulated. The raw gene expression values are normalized using a z-score.
Functional enrichment analysis. We performed functional enrichment analyses using the Database for Annotation, Visualization and Integrated Discovery (DAVID) gene functional annotation tool 35,36 to identify significant Gene Ontology (GO) terms with FDR < 0.05 . GO terms are biological annotations that signify functional characteristics. They are divided into three main categories: molecular function (MF), cell composition (CC), and biological processes (BP). The most significant GO terms were analyzed using DAVID to identify enriched terms with a threshold value of FDR (adjusted p-value) < 0.05 . Similar in function to the adjusted-p-value, the lower the FDR, the more significant the enrichment. Statistically significant GO terms were also expressed as a z-score expression where N upregulated and N downregulated represent the number of upregulated and downregulated genes respectively. This expression value signifies whether the GO term is more likely to be downregulated (negative value) or upregulated (positive value). We visualized the top 10 GO terms and their z-score expressions using the GOplot package (version 1.0.2) in R 37 .
String-db 38 analysis was also utilized to identify significant Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The KEGG database contains genomic information, functions, recognized pathways, and networks with higher-order functional information of various organisms 39 . www.nature.com/scientificreports/ PPI network construction. We constructed a protein-protein interaction (PPI) network based on the DEGs using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) Interactome database 38 . It is a biological database and web resource of known and predicted protein-protein interactions. The network was then input into Cytoscape 22 for further analysis with the CytoHubba 23 plugin. Using all 12 topological analysis methods, we identified our list of candidate biomarkers composed of the top genes selected by each method.
AUC performance evaluation and Integrated AUC . We evaluated the diagnostic value and functional significance of the biomarkers using AUC, which measures the trade-off between sensitivity and specificity. AUC was calculated for each gene using the pROC package in R Studio 40 , which was also used to visualize the ROC curves. Similarly, to evaluate the performance of multiple biomarkers at once, Integrated AUC was calculated by first aggregating (in this study, we use mean) the expression values of the biomarkers, and then evaluating the AUC of that aggregated expression. To identify our 4-gene signature, we systematically aggregated the top biomarker expression values one by one and determined when the AUC reaches its peak.
Survival analysis and validation of biomarkers. The Kaplan-Meier plotter database (http:// kmplot. com) is an online tool used to analyze the associations between the identified hub genes and overall survival. The overall survival (OS) plots were based on 1925 lung cancer patients from GEO and TGCA (The Cancer Genome Atlas database) datasets. For the RASIP1 gene, however, the dataset used for analysis was restricted to only samples that were tested with the HGU133 Plus 2.0 microarray, as it was the only probe set available for RASIP1. Therefore, the Kaplan-Meier for RASIP1 was limited to 1144 patients. The hazard ratio (HR) with 95% confidence intervals and log-rank p-value were calculated, and a p-value < 0.05 was used to indicate a statistically significant difference. The expression levels of the hub genes and their association with survival were additionally assessed using the web-based GEPIA database (http:// gepia. cancer-pku. cn/) 41 with the settings of p < 0.05 and |Log2FC| > 1.

Stage II: Critical network topological metrics. Evaluation of composite metrics.
To evaluate composite metrics, we first fused, or supserset, the genes identified through each individual metric. Then, Integrated AUC of the top 10 and 20 genes of the superset was evaluated to ensure a fair assessment of each composition and to eliminate any bias surrounding using more or less number of biomarkers.
In addition to AUC, to quantify the ability of each metric to correctly identify the biomarkers that were included in the overall top 10 and top 20 ranked biomarkers found in Table 1, we also define a Precision metric as follows: where N correct and N total represent the number of correctly identified genes and total number of top ranked genes respectively.
To find the number of correctly identified genes ( N correct ), we took the top 10 and 20 disease genes from the supersets of the genes selected by the composite metrics and calculated how many of them matched with the overall top 10 and 20 disease-correlated genes from Table 1b. These overall top 10 and 20 genes were identified by AUC score in the "PPI network analysis" section.
Machine learning validation and Cascading Model design. The primary focus of this study was to improve the diagnostic capability of early cancer detection. To validate our findings, we evaluated the diagnostic performances of the metrics and signatures with clinicogenomic random forest (RF) models in the 20% dataset. Among various classification models, including artificial neural networks, XGBoost models, SVM, and decision tree models, the RF models exhibited the best performance. RF was additionally found to have advantages over other classification algorithms in terms of robustness to overfitting, ability to handle non-linear data, and stability in the presence of outliers, as previously reported 42 .
Our work above focused on the performance of biomarkers and how to improve the search for them. However, compared to only using genetic information, the usage of a variety of other factors will help improve the accuracy of diagnosis. Some clinical attributes that impact the development of NSCLC are age, gender, and smoking status. In order to incorporate both clinical and genomic attributes, we explored the use of machine learning techniques to adequately consider multiple factors at once in cancer prediction to design a clinicogenomic model. The 4-gene and C-Index models were further validated in an external cohort, the TCGA LUAD and LUSC datasets. Phenotype and gene count files were obtained from the UCSC Xena database, and underwent similar preprocessing and normalization procedures as the GEO datasets. Subsequently, the pretrained models were assessed using this new dataset.
In addition to validating our models, we sought to expand the 4-gene model into a Cascading Model that not only leverages the top biomarkers for accurate prediction of cancer status, but also has the capability to differentiate between early and late stages of cancer and lung cancer stages I-IV. Our goal is to develop a precise clinicogenomic diagnostic model that can utilize the selected top biomarkers to accurately predict NSCLC disease stage. In our initial study, we extended the RF model to a multi-class model, which as expected, exhibited high accuracy in classifying cancer from non-cancer cases. However, the multi-class model showed lower accuracy in classifying cancer stages, likely due to limited data availability for stages III and IV.
To more accurately identify the stage of cancer, we propose a multi-stage Cascading Model with 3 stages, depicted in Supplementary Fig. S2 www.nature.com/scientificreports/ classifies the data into cancer or no cancer, then further classifies those with cancer into early vs. late stages of cancer, and finally classifies early and late one step further into cancer stages I-IV. In the 4-gene model, the first classification cancer status had an average accuracy of 0.9553 and AUC of 0.98605. The second classification, early vs. late stages, had an average accuracy of 0.9716 and AUC of 0.9902. Cancer stages I-IV classification had an average accuracy of 0.7137, which may be due to the minor difference between the four cancer stages of patients. The difference of clinical attributes between cancer and no cancer and early vs. late is a lot greater than the difference between the four stages. The model may have difficulty distinguishing the two sides of the boundary. However, as exemplified in the performance study, the Cascading model greatly improves the ability to accurately differentiate between multiple stages of cancer, and is one of the first capable of accurately predicting early vs. late cancer stages.

Data availibility
The datasets GSE31210, GSE33356, and GSE50081 are available online from the GEO database.